1 Executive Summary

In this exercise, we analyse listings from Airbnb bookings (via insideairbnb.com) to better understand price patterns within a particular city. The city we examined was Cape Town and we tried to predict the total cost of four nights for two people To kick things off, we performed EDA to examine the data in more detail. Next, we subset the data and performed feature engineering to create new features which may suitable for further modeling. From this exercise, we move to regression analysis where we iteratively test out a few indicators to see if they are significant in explaining the variability within the dataset. We build up on this to create a final model of 7 covariates (property type, room type, review ratings scores, cancellation policy, how many people the property accommodates, which neighbourhood the property is in and if the property has air conditioning installed. This model achieves a R squared of 0.512. Finally, we use this model to make a prediction on unseen problem.

2 Exploratory Data Analysis

2.1 Data background

This data set contains information on publicly available Airbnb listings in the city of Cape Town, South Africa. It was retreived from insideairbnb.com Inside Airbnb is an independent, non-commercial website that offers a set of tools and data that allows you to explore how Airbnb is being used in cities around the world. The original data was scraped from the the Airbnb website (www.airbnb.com). The raw scraping data was then analyzed, cleaned and aggregated where appropriate to faciliate the public discussion and made available here: http://data.insideairbnb.com/south-africa/wc/cape-town/2020-06-21/data/listings.csv.gz

2.2 Looking at the raw values

First, we will take a look at our data.

glimpse(data)
## Rows: 24,062
## Columns: 106
## $ id                                           <dbl> 3191, 15007, 15068, 1507…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-23, 2020-06-24,…
## $ name                                         <chr> "Malleson Garden Cottage…
## $ summary                                      <chr> "This is a lovely, separ…
## $ space                                        <chr> "Beautiful self catering…
## $ description                                  <chr> "This is a lovely, separ…
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "Mowbray is on the South…
## $ notes                                        <chr> "There are pets on the p…
## $ transit                                      <chr> "The train station is ab…
## $ access                                       <chr> "Fully self-contained co…
## $ interaction                                  <chr> "I live in the main hous…
## $ house_rules                                  <chr> "There is no smoking in …
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 3754, 59072, 59318, 5934…
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Brigitte", "Dirk", "Lin…
## $ host_since                                   <date> 2008-10-21, 2009-12-01,…
## $ host_location                                <chr> "Cape Town, Western Cape…
## $ host_about                                   <chr> "I'm single and love to …
## $ host_response_time                           <chr> NA, "within a day", NA, …
## $ host_response_rate                           <chr> NA, "100%", NA, "100%", …
## $ host_acceptance_rate                         <chr> "100%", "100%", "33%", "…
## $ host_is_superhost                            <lgl> TRUE, TRUE, FALSE, FALSE…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_listings_count                          <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_total_listings_count                    <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_verifications                           <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ street                                       <chr> "Southern Suburbs, Weste…
## $ neighbourhood                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ neighbourhood_cleansed                       <chr> "Ward 57", "Ward 23", "W…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Southern Suburbs", "Cap…
## $ state                                        <chr> "Western Cape", "Western…
## $ zipcode                                      <dbl> 7700, 7441, 7441, 7441, …
## $ market                                       <chr> "Cape Town", "Cape Town"…
## $ smart_location                               <chr> "Southern Suburbs, South…
## $ country_code                                 <chr> "ZA", "ZA", "ZA", "ZA", …
## $ country                                      <chr> "South Africa", "South A…
## $ latitude                                     <dbl> -33.9, -33.8, -33.8, -33…
## $ longitude                                    <dbl> 18.5, 18.5, 18.5, 18.5, …
## $ is_location_exact                            <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ property_type                                <chr> "Guesthouse", "House", "…
## $ room_type                                    <chr> "Entire home/apt", "Enti…
## $ accommodates                                 <dbl> 2, 6, 6, 2, 2, 2, 4, 3, …
## $ bathrooms                                    <dbl> 1.0, 3.0, 2.0, 1.5, 1.0,…
## $ bedrooms                                     <dbl> 1, 3, 3, 1, 1, 1, 2, 1, …
## $ beds                                         <dbl> 1, 4, 5, 2, 1, 1, 4, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{TV,\"Cable TV\",Wifi,\…
## $ square_feet                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ price                                        <chr> "$815.00", "$2,038.00", …
## $ weekly_price                                 <chr> "$5,204.00", "$9,000.00"…
## $ monthly_price                                <chr> "$20,816.00", "$40,000.0…
## $ security_deposit                             <chr> NA, "$2,500.00", "$3,000…
## $ cleaning_fee                                 <chr> "$260.00", "$850.00", "$…
## $ guests_included                              <dbl> 2, 6, 1, 2, 2, 3, 2, 1, …
## $ extra_people                                 <chr> "$0.00", "$250.00", "$0.…
## $ minimum_nights                               <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_nights                               <dbl> 730, 120, 730, 1125, 730…
## $ minimum_minimum_nights                       <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_minimum_nights                       <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ minimum_maximum_nights                       <dbl> 730, 120, 730, 1125, 730…
## $ maximum_maximum_nights                       <dbl> 730, 120, 730, 1125, 730…
## $ minimum_nights_avg_ntm                       <dbl> 3.0, 2.0, 4.0, 5.0, 2.0,…
## $ maximum_nights_avg_ntm                       <dbl> 730, 120, 730, 1125, 730…
## $ calendar_updated                             <chr> "7 months ago", "4 month…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 30, 0, 30, 0, 30, 30, 0,…
## $ availability_60                              <dbl> 60, 0, 60, 0, 60, 60, 0,…
## $ availability_90                              <dbl> 90, 0, 90, 0, 90, 90, 0,…
## $ availability_365                             <dbl> 365, 0, 365, 234, 365, 3…
## $ calendar_last_scraped                        <date> 2020-06-23, 2020-06-24,…
## $ number_of_reviews                            <dbl> 54, 27, 0, 5, 0, 2, 8, 3…
## $ number_of_reviews_ltm                        <dbl> 6, 1, 0, 0, 0, 0, 1, 0, …
## $ first_review                                 <date> 2013-05-31, 2013-12-15,…
## $ last_review                                  <date> 2020-03-07, 2019-08-18,…
## $ review_scores_rating                         <dbl> 97, 97, NA, 100, NA, 90,…
## $ review_scores_accuracy                       <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_cleanliness                    <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_checkin                        <dbl> 10, 10, NA, 10, NA, 9, 1…
## $ review_scores_communication                  <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_location                       <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_value                          <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> TRUE, FALSE, FALSE, FALS…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "moderate", "strict_14_w…
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 4, 10, 5, 1, 1, 3, 1,…
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 4, 10, 1, 1, 1, 2, 1,…
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 4, 0, 0, 0, 0, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 0.63, 0.34, NA, 0.06, NA…

In the original dataset, there are 24062 observations and 106 variables. There are 41 numerical variables and 43 categorical/factor variables as seen in the table above.

2.3 Variable Selection

In order to fit our model, we will remove columns that are redundant, of poor data quality, or that cannot be used to explore our dependent variable, price for four nights for 2 people.

#Vector of all columns we want to exclude
columns_deselect <- c(3:19, 20:22, 24:25, 27:28, 30:36, 38:39, 41:48, 58, 60, 62:63, 69:82, 84, 94:95, 96, 98, 100:105)

#Deselecting all columns we dont need
listings <- data %>%
  select(-columns_deselect)

In the table below, we outline our reasoning for removing some variables.

## Rows: 70
## Columns: 2
## $ `Variable Name` <chr> "scrape_id", "last_scraped", "name", "summary", "spac…
## $ Reason          <chr> "The scrape ID is the same for every entry.", "Date o…
kable(deselected_variables[1:2]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 14, position = "center") %>%
  scroll_box(width = "910px", height = "400px")
Variable Name Reason
scrape_id The scrape ID is the same for every entry.
last_scraped Date of scraping not relevant for further data analysis.
name All of the variables are different and only text data.
summary Host description for their listing in text form. Would be difficult to extract information.
space Host description for their listing in text form. Would be difficult to extract information.
description Host description for their listing in text form. Would be difficult to extract information.
experiences_offered All values are “none”.
neighborhood_overview Host description for neighborhood in text form. Would be difficult to extract information.
notes Additional information provided by host in text form. Would be difficult to extract information.
transit Host description for transit in text form. Would be difficult to extract information.
access Host description for listing access in text form. Would be difficult to extract information.
interaction Host description for interaction with guests in text form. Would be difficult to extract information.
house_rules Host description for house rules in text form. Would be difficult to extract information.
thumbnail_url Url not relevant for further data analysis.
medium_url Url not relevant for further data analysis.
picture_url Url not relevant for further data analysis.
xl_picture_url Url not relevant for further data analysis.
host_id All are unique values, therefore is redundant since we are keeping listing ID
host_url Url not relevant for further data analysis.
host_name Host name is redundant with listing ID
host_location Variable is inconsistently formatted and not relevant for further data analysis.
host_about Host introduction in text form. Would be difficult to extract information.
host_response_rate Most of the variables are N/A or 100%, therefore would not be useful in explaining our dependent variable.
host_acceptance_rate The distribution is very heavily left skewed and contains many N/A values.
host_thumbnail_url Url not relevant for further data analysis.
host_picture_url Url not relevant for further data analysis.
host_neighbourhood Only contains N/A values.
host_listings_count Captured in calculated_host_listings_count, thus is redundant.
host_total_listings_count Captured in calculated_host_listings_count, thus is redundant.
host_verifications Type of verififcation irrelevant for analysis.
host_has_profile_pic Out of 17,300 ppl, only 50 have no picutre.
street Street name is too granular, we will use neighborhood_cleansed to account for location in our analysis.
neighbourhood Only contains N/A values.
neighbourhood_group_cleansed Only contains N/A values.
city Data quality is poor, as there are typos and the majority of entries are in Cape Town.
state Data quality is poor, as there are typos and different formats. The majority of entries are in the same state.
zipcode Zip code is too granular, we will use neighborhood_cleansed to account for location in our analysis.
market Market is almost always Cape Town.
smart_location Data quality is poor, with cities spelled in different ways.
country_code All listings are in South Africa with the exception of one listing in Congo. This is obviously a mistake, as other data like latitude or longitude clearly shows the listing is in Cape Town. Hence we can delete the column.
country All listings are in ZA with the exception of one listing in CG. This is obviously a mistake, as other data like latitude or longitude clearly shows the listing is in Cape Town. Hence we can delete the column.
bed_type The vast majority of listings (24,029) only contain one type of bed: “Real Bed”
square_feet Data is incomplete. 99.8% are missing values.
weekly_price Since we are only looking at rentals for <=4 days, the weekly price is irrelevant.
monthly_price Since we are only looking at rentals for <= 4 days, the monthly price is irrelevant.
maximum_nights We are looking at listings where minimum nights is <= 4, thus maximum nights is irrelevant.
minimum_minimum_nights This is a forward-looking variable for the smallest minimum nights available looking 1 year out. This is irrelevant for our analysis.
maximum_minimum_nights This is a forward-looking variable for the largest minimum nights available looking 1 year out. This is irrelevant for our analysis.
minimum_maximum_nights This is a forward-looking variable for the smallest maximum nights available looking 1 year out. This is irrelevant for our analysis.
maximum_maximum_nights This is a forward-looking variable for the largest maximum nights available looking 1 year out. This is irrelevant for our analysis.
minimum_nights_avg_ntm This is a forward-looking variable for the average minimum nights available looking 1 year out. This is irrelevant for our analysis.
maximum_nights_avg_ntm This is a forward-looking variable for the average maximum nights available looking 1 year out. This is irrelevant for our analysis.
calendar_updated Irrelevant for future analysis as it only indicates when the listing calendar was last updated.
has_availability Only has one unique value: TRUE
availability_30 This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis.
availability_60 This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis.
availability_90 This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis.
availability_365 This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis.
calendar_last_scraped Irrelevant for future analysis as it only indicates when the calendar was last scraped.
number_of_reviews_ltm Represents the number of reviews in the last twelve months. It is highly correlated with number_of_reviews. We decided to keep number_of_reviews because that is how the user interface shows it on the Airbnb website.
requires_license Contains only N/A values.
license 99.9% incomplete.
jurisdiction_names Contains only N/A values.
is_business_travel_ready Only has one unique value: FALSE
require_guest_profile_picture 99.9% are FALSE.
require_guest_phone_verification Around 99% are FALSE.
calculated_host_listings_count Most of the entries are 1 with some extreme outliers.
calculated_host_listings_count_entire_homes Most of the entries are 1 with some extreme outliers.
calculated_host_listings_count_private_rooms Most of the entries are 1 with some extreme outliers.
calculated_host_listings_count_shared_rooms Most of the entries are 1 with some extreme outliers.

2.4 Data Wrangling

In order to run models with our data, we need to modify some variables and create some new variables that we want to examine.

First, we must change the price variables (price, cleaning_fee, extra_people, and security_deposit) to numeric variables, as they are now characters.

# Changing price variables to numeric 
listings <- listings %>% 
  mutate(price = parse_number(price),
    cleaning_fee = parse_number(cleaning_fee),
    extra_people = parse_number(extra_people),
    security_deposit = parse_number(security_deposit))

# Check if these variables are numeric, whether we have any zero prices for `price` or there are any NAs
listings %>% 
  select(price, cleaning_fee, extra_people) %>% 
  skim()%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric price 0 1.000 2470.0 7527 0 659 1058 2003.0 300009 ▇▁▁▁▁
numeric cleaning_fee 7049 0.707 338.4 512 0 100 300 450.0 40000 ▇▁▁▁▁
numeric extra_people 0 1.000 83.8 220 0 0 0 96.5 4681 ▇▁▁▁▁

We will also change host_response_time to a factor variable, that is ordered based on how long a host takes to respond. That way we can determine whether there’s a relationship between response time length and price.

# Changing host_response_time to a factor variable 

listings <- listings%>% 
  mutate(host_response_time = factor(host_response_time, 
  order = TRUE, levels = c("within an hour", "within a few hours", "within a day",
                           "a few days or more")))

We will also turn the listing ID’s into character strings rather than numeric values.

#Turn IDs into string

listings <- listings %>% 
  mutate(id = as.character(id),
         host_id = as.character(id))

We create a new variable called host_experience_y which tells us the number of years a host is active on Airbnb. We want to examine how long a host has been active impacts how much they might charge for a listing.

#max_date in host_since = 2020-06-17

# Creating a new variable call host_experience_y which tells us about the no of years the host is active on Airbnb

listings <- listings %>% 
  mutate(host_experience_y = as.numeric(ymd("2020-06-17") - host_since)/365)

For security_deposit, since there are 8559 missing values, we will treat the missing values as having no security deposit. We will also convert to a dummy variable where if the listing has a security deposit the variable will be TRUE, an if not, FALSE. We do this because there is a massive range of security deposits and the distribution is very skewed. This will help us determine whether having a security deposit influences the price.

listings%>%
  skim(security_deposit)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric security_deposit 8559 0.644 2770 6319 0 0 1478 3000 150000 ▇▁▁▁▁
kable(favstats(~security_deposit, data=listings))%>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 0 1478 3000 150000 2770 6319 15503 8559
#turning security_deposit into a dummy variable 
listings <- listings %>%
  mutate(security_deposit = case_when(
    is.na(security_deposit) ~ 0,
    TRUE ~ security_deposit
  )) %>%
  mutate(security_deposit=case_when(
    security_deposit==0~FALSE,
    TRUE~TRUE
  ))

Next, we look at cancellation_policy. It looks like there are several types of cancellation policies that only have one or 2 listings with that category based on the count below. Additionally, to simplify, we can group all of the policies into three main categories: flexible, moderate, and strict.

listings %>%
  count(cancellation_policy)%>%
   kable()%>%
  kable_styling()
cancellation_policy n
flexible 7822
luxury_moderate 59
luxury_super_strict_125 1
luxury_super_strict_95 9
moderate 5620
strict 2
strict_14_with_grace_period 10101
super_strict_30 397
super_strict_60 51
#We first convert cancellation policy into a 3-category variable where cancellation policy can be flexible, moderate, and strict. 

listings<- listings %>%
  mutate(
    cancellation_policy = case_when (cancellation_policy %in% c("flexible") ~"flexible", cancellation_policy %in% c("luxury_moderate", "moderate") ~"moderate", 
cancellation_policy %in%c("luxury_super_strict_125","luxury_super_strict_95",
                          "strict","strict_14_with_grace_period","super_strict_30",
                          "super_strict_60")~"strict"),
  )

We want to examine whether having air conditioning affects the listing price, as we noticed that only around 30% of properties have air conditioning installed. We introduce a new variable that evaluates to TRUE if string ‘Air conditioning’ is included in amenities, and FALSE otherwise.

#creating the A/C variable
listings <- listings %>% 
  mutate(ac=str_detect(amenities,'Air conditioning'))

proportions<-listings %>% 
  group_by(ac) %>% 
  summarise(count=n()) %>% 
  mutate(proportion=count/sum(count))
proportions%>%
    kable()%>%
  kable_styling()
ac count proportion
FALSE 17127 0.712
TRUE 6935 0.288

It seems that having Air conditioning in Cape Town is some sort of a luxury. Furthermore, according to website Statista, only 6% of households in South Africa have air conditioning (Global AC penetration rate by country 2016 | Statista, 2020).

We will also convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values.

listings<- listings %>% 
  mutate(total_amenities=str_count(amenities, ',')+1)

Additionally, we simplify the neighbourhood_cleansed by condensing them into the new variable neighborhood_simplified. Cape Town is separated into over 100 Wards, which would be difficult to include in our model. Through our research, we found that Cape Town is also segmented into 24 sub-councils. We assigned each ward to one of the 24 sub-councils following official data from capetown.gov.za. Then, based on an interview with a Cape Town local, we assigned each sub-council a grade corresponding to cost of living and quality of life, with the levels: best, better, good, and standard.

# Cape Town is made up of 100+ diffeerent Wards. 
# The column neighborhood_cleansed contains information on which Ward the listing is located in
# We first assign each Ward to 1 of the 24 sub-councils that make up Cape Town, following official data from the Cape Town website
# (http://resource.capetown.gov.za/documentcentre/Documents/Maps%20and%20statistics/Population_and_Households_by_Ward_2001_and_2011.pdf)



#Vector for each sub district

SC01 <-c("Ward 23", "Ward 29",  "Ward 32",  "Ward 104")
SC02 <-c("Ward 6",  "Ward 7",   "Ward 8",   "Ward 101", "Ward 102", "Ward 111")
SC03 <-c("Ward 1",  "Ward 4",   "Ward 5",   "Ward 70",  "Ward 107", "Ward 113")
SC04 <-c("Ward 25", "Ward 26",  "Ward 27",  "Ward 28",  "Ward 30")
SC05 <-c("Ward 13", "Ward 20",  "Ward 24",  "Ward 31",  "Ward 50",  "Ward 106")
SC06 <-c("Ward 2",  "Ward 3",   "Ward 9",   "Ward 10",  "Ward 12",  "Ward 22")
SC07 <-c("Ward 21", "Ward 103", "Ward 105", "Ward 112")
SC08 <-c("Ward 83", "Ward 85",  "Ward 86",  "Ward 100")
SC09 <-c("Ward 18", "Ward 87",  "Ward 89",  "Ward 90",  "Ward 91",  "Ward 116")
SC10 <-c("Ward 92", "Ward 93",  "Ward 94",  "Ward 97",  "Ward 98",  "Ward 99")
SC11 <-c("Ward 40", "Ward 44",  "Ward 46",  "Ward 47")
SC12 <-c("Ward 78", "Ward 79",  "Ward 81",  "Ward 82")
SC13 <-c("Ward 34", "Ward 35",  "Ward 36",  "Ward 80",  "Ward 88")
SC14 <-c("Ward 37", "Ward 38",  "Ward 39",  "Ward 41",  "Ward 42",  "Ward 45")
SC15 <-c("Ward 51", "Ward 52",  "Ward 53",  "Ward 55",  "Ward 56")
SC16 <-c("Ward 54", "Ward 57",  "Ward 74",  "Ward 77",  "Ward 115")
SC17 <-c("Ward 48", "Ward 49",  "Ward 60")
SC18 <-c("Ward 63", "Ward 65",  "Ward 66",  "Ward 67",  "Ward 68",  "Ward 110")
SC19 <-c("Ward 61", "Ward 64",  "Ward 69")
SC20 <-c("Ward 58", "Ward 59",  "Ward 62",  "Ward 71",  "Ward 72",  "Ward 73")
SC21 <-c("Ward 11", "Ward 19",  "Ward 108")
SC22 <-c("Ward 14", "Ward 16",  "Ward 17",  "Ward 114")
SC23 <-c("Ward 33", "Ward 43",  "Ward 75",  "Ward 76")
SC24 <-c("Ward 15", "Ward 84",  "Ward 95",  "Ward 96",  "Ward 109")
# Based on an interview with a local, we then assigned the 24 sub-council a grade corresponding to quality of life 

# vector for living quality

Best <-c(SC16, SC19, SC20)
Better <-c(SC01, SC08, SC15, SC17)
Good <-c(SC02, SC03, SC04, SC05, SC06, SC07, SC08, SC15, SC21, SC22, SC23, SC24)
Standard <-c(SC09, SC10, SC11, SC12, SC13, SC14, SC17, SC18)
# Next, we create two new variables: "sub_council", containing the sub council for each listing and "neighborhood_simplified", classing the location of the listing to either standard, good, better or best.
#assigning sub council

listings <- listings %>% 
  mutate(sub_council = case_when(neighbourhood_cleansed %in% SC01 ~ "SC01",
                                 neighbourhood_cleansed %in% SC02 ~ "SC02",
                                 neighbourhood_cleansed %in% SC03 ~ "SC03",
                                 neighbourhood_cleansed %in% SC04 ~ "SC04",
                                 neighbourhood_cleansed %in% SC05 ~ "SC05",
                                 neighbourhood_cleansed %in% SC06 ~ "SC06",
                                 neighbourhood_cleansed %in% SC07 ~ "SC07",
                                 neighbourhood_cleansed %in% SC08 ~ "SC08",
                                 neighbourhood_cleansed %in% SC09 ~ "SC09",
                                 neighbourhood_cleansed %in% SC10 ~ "SC10",
                                 neighbourhood_cleansed %in% SC11 ~ "SC11",
                                 neighbourhood_cleansed %in% SC12 ~ "SC12",
                                 neighbourhood_cleansed %in% SC13 ~ "SC13",
                                 neighbourhood_cleansed %in% SC14 ~ "SC14",
                                 neighbourhood_cleansed %in% SC15 ~ "SC15",
                                 neighbourhood_cleansed %in% SC16 ~ "SC16",
                                 neighbourhood_cleansed %in% SC17 ~ "SC17",
                                 neighbourhood_cleansed %in% SC18 ~ "SC18",
                                 neighbourhood_cleansed %in% SC19 ~ "SC19",
                                 neighbourhood_cleansed %in% SC20 ~ "SC20",
                                 neighbourhood_cleansed %in% SC21 ~ "SC21",
                                 neighbourhood_cleansed %in% SC22 ~ "SC22",
                                 neighbourhood_cleansed %in% SC23 ~ "SC23",
                                 neighbourhood_cleansed %in% SC24 ~ "SC24"
                                 ))
# Assigning neighbourhood simplified

listings <- listings %>% 
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% Best ~ "best",
                                              neighbourhood_cleansed %in% Better ~ "better",
                                              neighbourhood_cleansed %in% Good ~ "good",
                                              neighbourhood_cleansed %in% Standard ~ "standard"))

Now we will take a look at the minimum_nights variable and filter only to include listings where the minimum number of nights is less than or equal to 4.

#To check the 5 most common values for the variable `minimum_nights`
listings%>%
  count(minimum_nights)%>% 
  arrange(desc(n))%>%
  kable()%>%
  kable_styling()
minimum_nights n
2 7436
1 6894
3 3527
7 1614
5 1489
4 1098
10 508
14 337
20 252
6 244
30 182
8 62
90 59
28 50
12 47
15 36
21 35
60 33
180 27
9 17
13 12
31 11
25 9
100 8
120 7
29 6
365 6
11 5
18 3
27 3
45 3
93 3
150 3
360 3
16 2
19 2
22 2
24 2
32 2
95 2
168 2
183 2
185 2
210 2
26 1
33 1
40 1
42 1
89 1
91 1
94 1
124 1
182 1
186 1
240 1
300 1
1125 1
#To filter the variable `minimum_nights`
listings <- listings %>%
  filter(minimum_nights<= 4)

The top 5 most common values are for minimum nights are: 2, 1, 3, 7 and 5. The top values seem to be a week or less, which means most listings are probably used for tourism. However, some of the less common values also offer minimum stays of 30 or 365 days which demonstrates that some listings can be monthly or yearly rentals.

Finally we will inspect the accommodates variable. We want to filter out the 737 listings that can only accommodate one person, since we are interested in places where two people could rent.

#To check the 5 most common values for the variable accommodates
listings %>%
  count(accommodates) %>%
  arrange(desc(accommodates))%>%
  kable()%>%
  kable_styling()
accommodates n
40 1
20 1
18 1
16 81
15 5
14 50
13 13
12 130
11 13
10 294
9 61
8 712
7 214
6 1785
5 617
4 4466
3 1069
2 8705
1 737
#as we can see, the data set includes 737 listings for only 1 person
#Because we are only interested in offerings that can accommodate 2 people, we filter

#To filter the variable `accommodates`
listings <- listings %>%
  filter(accommodates >= 2)

2.5 Handling NAs

Some of our variables have missing values. Cleaning fee, for example, which we will use to calculate price for four nights, has 7049 missing values, as seen below. This is most likely because these listings do not charge a cleaning fee, so we will treat N/A values as 0.

listings%>%
skim(cleaning_fee)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric cleaning_fee 4894 0.731 303 368 0 80 250 450 17347 ▇▁▁▁▁
listings <- listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

2.6 Creating our dependent variable: price_4_nights

listings <- listings%>%
  filter(guests_included<=2)%>% 
  mutate(
    extra_charge_pp= case_when(guests_included==2 ~ 0, guests_included ==1 ~ extra_people),
    price_4_nights= price*4 + extra_charge_pp*4 + cleaning_fee
  )

2.7 Computing summary statistics and visualizing the distribution of the variables of interest

Now, we will look at some of the variables that we want to examine when we eventually build our model. We want to look at how the variables are distributed, and whether there is any correlation between explanatory variables.

First, we will examine the host_experience_y variable that we created, which demonstrates how long a host has been listed on Airbnb.

#Descriptive Analysis

kable(favstats(~host_experience_y, data=listings))%>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 2.53 4.16 5.34 11.7 4.04 2.06 16562 2
skim(listings$host_experience_y)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric data 2 1 4.04 2.06 0 2.53 4.16 5.34 11.7 ▅▇▆▂▁
ggplot(listings, aes(x=host_experience_y)) +
  geom_density(fill = "lavender") + 
  labs(x = "Host Experience", y="Density", title= "Host Experience Density Plot") +
  theme_minimal()

Based on the density plot, it looks like host_experience_y is a multi-modal distribution which is slightly right skewed. The mean is 4.13 years and the median is 4.27.

Next, we will look at host_is_superhost. We might want to use this in our model to examine whether superhosts tend to charge a premium.

skim(listings$host_is_superhost)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate logical.mean logical.count
logical data 2 1 0.25 FAL: 12417, TRU: 4145
listings %>%
  count(host_is_superhost)%>%
  kable()%>%
  kable_styling()
host_is_superhost n
FALSE 12417
TRUE 4145
NA 2
ggplot(listings, aes(x = host_is_superhost)) +
  geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
  labs(x = "Super Host", y="Density", title= "Distribution of Superhosts") +
  theme_minimal() 

Based on the data above, roughly 1/4 of all hosts in Cape Town are super hosts.

Now, we look at number_of_reviews, which represents the total number of reviews on a particular listing.

listings%>%
  skim(number_of_reviews)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric number_of_reviews 0 1 16.2 31 0 0 4 17 425 ▇▁▁▁▁
kable(favstats(~number_of_reviews, data=listings))%>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
0 0 4 17 425 16.2 31 16564 0
listings%>%
  ggplot(aes(x=number_of_reviews))+
  geom_histogram(aes(x = number_of_reviews), color = "blue", fill = "#77a5f0") +
  labs(x = "Number of Reviews", y="Count", title= "Total Number of Reviews") +
  ggtitle("Total Number of Reviews") +
  coord_cartesian(xlim = c(0, 220)) +
  theme_minimal()

Based on the histogram, number of reviews is very skewed. It seems like the majority of listings do not have many reviews. The median is 3, yet the mean is 14.42, indicating that the variable is right-skewed.

Now we’ll look at the review_scores_rating variable, as we want to examine whether listings with higher reviews charge a premium.

  kable(favstats(~review_scores_rating, data=listings))%>%
  kable_styling()
min Q1 median Q3 max mean sd n missing
20 93 97 100 100 94.5 8.93 11959 4605

There are 7314 missing values for review_scores_rating, which most likely means that those listings did not receive a review. The median rating is 97 and the mean rating is 94.58. The standard deviation is 8.87.

Now let’s visualize the distribution:

listings%>%
  ggplot(aes(x=review_scores_rating))+
  geom_density(fill="lavender") +
  labs(x = "Review Score Rating") +
  ggtitle("Distribution of Review Score Rating.") +
  theme_minimal()

The distribution is left-skewed with most of the ratings in the 90’s. There is a small second peak at 80.

The following variables are reviews for individual categories. They are all discrete values from 0-10 and exhibit similar distributions. We will keep them all to explore whether some variables impact price more than others.

accuracy <- listings%>%
  ggplot(aes(x=review_scores_accuracy)) +
  geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
  labs(x = "Review Score Accuracy") +
  ggtitle("Accuracy") +
  theme_minimal()
cleanliness <- listings%>%
  ggplot(aes(x=review_scores_cleanliness))+
  geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
  labs(x = "Review Score Cleanliness") +
  ggtitle("Cleanliness") +
  theme_minimal()
communication <- listings%>%
  ggplot(aes(x=review_scores_communication))+
  geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
  labs(x = "Review Score Communication") +
  ggtitle("Communication") +
  theme_minimal()
check_in <- listings%>%
  ggplot(aes(x=review_scores_checkin))+
  geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
  labs(x = "Review Score Check-in") +
  ggtitle("Check-in") +
  theme_minimal()
location <- ggplot(listings, aes(x = review_scores_location)) +
  geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
  labs(x = "Review Score Location") +
  ggtitle("Location") +
  theme_minimal() +
  NULL
value <- ggplot(listings, aes(x = review_scores_value)) +
  geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
  labs(x = "Review Score Value") +
  ggtitle("Value") +
  theme_minimal() +
  NULL
library(gridExtra)
grid.arrange(accuracy, cleanliness, check_in, communication, location, value, nrow = 3, ncol = 3)

We also want to look at whether if a property has an instant booking option it impacts price. It looks like, based on the table below, properties are pretty evenly split between the two options.

listings %>%
skim(instant_bookable)%>%
  kable()%>%
  kable_styling()
skim_type skim_variable n_missing complete_rate logical.mean logical.count
logical instant_bookable 0 1 0.563 TRU: 9332, FAL: 7232
ggplot(listings, aes(x = instant_bookable)) +
  geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
  labs(x = "Instant Booking") +
  ggtitle("Count of listings with or without Instant Booking") +
  theme_minimal() +
  NULL

We also want to explore the property_type variable.

#To check the top 4 categories of property type in Cape Town
listings %>%
count(property_type) %>%
arrange(desc(n))%>%
  kable()%>%
  kable_styling()
property_type n
Apartment 7256
House 3638
Guest suite 1314
Guesthouse 911
Bed and breakfast 569
Villa 484
Serviced apartment 407
Condominium 398
Cottage 350
Townhouse 318
Loft 262
Boutique hotel 158
Bungalow 90
Other 76
Hostel 66
Farm stay 50
Chalet 44
Tiny house 31
Aparthotel 29
Cabin 24
Hotel 24
Nature lodge 14
Camper/RV 9
Earth house 9
Boat 6
Resort 5
Dorm 4
Tipi 4
Hut 2
Lighthouse 2
Treehouse 2
Barn 1
Bus 1
Casa particular (Cuba) 1
Castle 1
Dome house 1
Heritage hotel (India) 1
Tent 1
Vacation home 1

As we can see, the four most common listings (Apartment, House, Guest Suite, and Villa) make up ~80% of all listings. Thus, to simplify, we will create a new variable with five categories: apartment, house, guest suite, villa, and other.

#To simplify the `property_type` variable into 5 categories, Mutate others to gather the other property type with small values
listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","House","Guest suite","Villa","Guest House","Bed and breakfast") ~ property_type, 
    TRUE ~ "Other"
  ))

listings %>% 
  ggplot(aes(x=prop_type_simplified)) +
  labs(x = "Property Type") +
  ggtitle("Simplified Property Type") +
  geom_bar(width = 0.5, color = "blue", fill = "#6dccc1") +
  theme_minimal()

Finally, we examine the distribution of listings based on quality of life in the area that they are located.

#count by neighborhood
listings %>%
  count(neighbourhood_simplified) %>%
  arrange(desc(n))%>%
  kable()%>%
  kable_styling()
neighbourhood_simplified n
best 10989
good 3000
better 2402
standard 173
ggplot(listings, aes(x = neighbourhood_simplified)) +
  geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
  labs(x = "Neighborhood", y="Count", title="Distribution of listings based on cost of living") +
  theme_minimal()

The majority of the Airbnb listings seem to be located in Wards that are classified as “best” in regards to location. These are likely to be popular destinations for tourists.

2.8 Examining correlations between variables

Now we will take a look at some of the relationships between the explanatory variables that we might want to include in our model, to see if anything stands out.

# To see the correlations between variables
listings%>%
  select(bedrooms, bathrooms, accommodates, review_scores_rating, host_experience_y, price_4_nights) %>% 
  ggpairs(alpha=0.4) +
  theme_minimal()

Based on the correlation matrix above, it looks like the variables bathroom,bedrooms, and accommodates are highly correlated. We most likely will not want to include all of them in our model.

Now, we will look at the relationship between property type and price.

listings %>%
  ggplot(aes(x=prop_type_simplified, y= price_4_nights)) +
           geom_boxplot(aes(fill=prop_type_simplified), show.legend=FALSE)+
           theme_minimal()+
          labs(x="Property Type", title="Property Type and Price for Four Nights")+
          scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))

Based on the plot, it seems like Villa and House might be indicators of a higher price. We will test them when we build our model.

We will also examine neighborhood in the plot below.

listings %>%
  ggplot(aes(x=neighbourhood_simplified, y= price_4_nights)) +
           geom_boxplot(aes(fill=neighbourhood_simplified), show.legend=FALSE)+
           theme_minimal()+
          labs(x="Neighbourood", title="Neighbourhood and Price for Four Nights")+
          scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))

As expected, it seems like listings in the “best” neighbourhoods are most expensive, while those in “standard” are cheapest. We will test whether this effect is significant when we build our model.

We will also look at host_is_superhost.

listings %>%
  ggplot(aes(x=host_is_superhost, y= price_4_nights)) +
           geom_boxplot(aes(fill=host_is_superhost, show.legend=FALSE))+
           theme_minimal()+
          labs(x="Superhost", title="Superhost and Price for Four Nights")+
          scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))

There does not seem to be a large distinction between whether a host is a superhost and the price of the listing, but we will confirm when we test our model.

We will also take a look at cancellation_policy.

listings %>%
  ggplot(aes(x=cancellation_policy, y= price_4_nights)) +
           geom_boxplot(aes(fill=cancellation_policy, show.legend=FALSE))+
           theme_minimal()+
          labs(x="Cancellation Policy", title="Cancellation Policy and Price for Four Nights")+
          scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))

It seems like listings with stricter policies are slightly more expensive, which could be because they are more desirable listings. We will test the significance in our model fitting.

Finally, we will examine the air conditioning variable that we created.

listings %>%
  ggplot(aes(x=ac, y= price_4_nights)) +
           geom_boxplot(aes(fill=ac, show.legend=FALSE))+
           theme_minimal()+
          labs(x="Air conditioning", title="Air conditioning and Price for Four Nights")+
          scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))

As expected, it seems like listings that have air conditioning are more expensive. We will examine the effect as we construct our model.

3 Mapping

Visualizing the map of listings in Cape Town

#creating the map
leaflet(data = filter(listings, minimum_nights <= 4,accommodates>=2)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

4 Regression Analysis

4.1 Visualizing the distribution of price_4_nights

#density plot price_4_nights
ggplot(data=listings, aes(x=price_4_nights)) +
  geom_density(fill = "lavender") +
  labs(x="log of Price for Four Nights") +
  theme_minimal()

#density plot log
ggplot(data=listings, aes(x=price_4_nights)) + 
  scale_x_log10()+geom_density(fill = "lavender") + 
  labs(x="log of Price for Four Nights") +
  theme_minimal()

Price_4_nights is very heavily positively skewed. Log (price_4_nights), on the other hand, is approximately normally distributed. Hence, log (price_4_nights) is a better candidate for further modeling.

4.1.1 Model 1: prop_type_simplified, number_of_reviews, review_scores_rating

Next, we fit a regression model with 3 explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

# ***** Fit linear regression models: First, just the mean
model0 <- lm(log (price_4_nights) ~ 1, data= listings)

model0 %>% 
  broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)8.450.00611.39e+0308.448.47
model0 %>% 
  broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
000.785-1.95e+043.9e+043.9e+041.02e+041656316564
# ***** Fit linear regression models: criminals on 3 explanatory variables.
model1 <- lm(log (price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data= listings)

model1 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.824 0.065 120.148 0.000 7.696 7.951
prop_type_simplifiedBed and breakfast -0.107 0.046 -2.319 0.020 -0.198 -0.017
prop_type_simplifiedGuest suite -0.422 0.022 -18.935 0.000 -0.465 -0.378
prop_type_simplifiedHouse 0.130 0.016 8.051 0.000 0.098 0.161
prop_type_simplifiedOther 0.000 0.017 -0.028 0.978 -0.033 0.033
prop_type_simplifiedVilla 1.233 0.038 32.763 0.000 1.160 1.307
number_of_reviews -0.001 0.000 -6.488 0.000 -0.002 -0.001
review_scores_rating 0.006 0.001 8.972 0.000 0.005 0.008
model1 %>% 
  broom::glance()%>%
 kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.131 0.13 0.667 257 0 7 -12118 24255 24321 5314 11951 11959

Below, we interpret the coefficients of our model: review_scores_rating For every 1 unit increase in review_scores_rating, the price_4_nights increases by 0.62% . As (exp(0.006188)-1)100 = 0.62% prop_type_simplified Price for 4 nights at a Bed and Breakfast is on average 10.2% lower than the price for an apartment. (Equivalent to (exp(-0.107467) - 1)100) = (-10.2%) Price for 4 nights at a Guest Suite is on average ** 34.4% lower than the price for an apartment. (Equivalent to (exp(-0.421721) - 1)100) = -34.4%) Price for 4 nights at a House is on average 13.8% higher than the price for an apartment. (Equivalent to (exp(0.129616) - 1)100) = 13.8%) Price for 4 nights at a Villa is on average 243.3% higher than the price for an apartment. (Equivalent to (exp(1.233319) - 1)*100)= 243.3%)

4.1.2 Model 2: adding room_type

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.

# ***** Fit linear regression models: criminals on 4 explanatory variables.
model2 <- lm(log (price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data= listings)

model2 %>% 
  broom::tidy(conf.int=TRUE)%>%
   kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.059 0.060 134.115 0.000 7.941 8.177
prop_type_simplifiedBed and breakfast 0.315 0.047 6.754 0.000 0.223 0.406
prop_type_simplifiedGuest suite -0.287 0.021 -13.894 0.000 -0.328 -0.247
prop_type_simplifiedHouse 0.369 0.016 23.534 0.000 0.338 0.399
prop_type_simplifiedOther 0.094 0.016 5.918 0.000 0.063 0.125
prop_type_simplifiedVilla 1.334 0.035 38.501 0.000 1.266 1.402
number_of_reviews -0.002 0.000 -9.458 0.000 -0.002 -0.001
review_scores_rating 0.004 0.001 6.790 0.000 0.003 0.006
room_typeHotel room 0.008 0.050 0.159 0.873 -0.091 0.107
room_typePrivate room -0.712 0.016 -45.918 0.000 -0.742 -0.682
room_typeShared room -1.046 0.109 -9.613 0.000 -1.259 -0.832
model2 %>% 
  broom::glance()%>%
   kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.267 0.266 0.613 434 0 10 -11103 22231 22320 4484 11948 11959

All else equal, room_typePrivate room and room_typeShared room are significant predictors of the cost for 4 night as it has T-statistics >2 and p-vale <0.05.

room_typeHotel room does not seem to be significantly different from the baseline room_type variable (“Entire home/apt”) as its |T statistic| < 2.

Now we will explore adding further variables.

4.1.3 Model 3: Improving model

We start our own exploration now. We first wonder if we can improve model 2.

model3 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type, data= listings)

model3 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.077 0.060 133.989 0.000 7.959 8.195
prop_type_simplifiedBed and breakfast 0.329 0.047 7.034 0.000 0.237 0.420
prop_type_simplifiedGuest suite -0.295 0.021 -14.221 0.000 -0.335 -0.254
prop_type_simplifiedHouse 0.379 0.016 24.183 0.000 0.348 0.410
prop_type_simplifiedOther 0.092 0.016 5.795 0.000 0.061 0.123
prop_type_simplifiedVilla 1.347 0.035 38.767 0.000 1.279 1.416
review_scores_rating 0.004 0.001 5.855 0.000 0.002 0.005
room_typeHotel room 0.025 0.051 0.504 0.614 -0.074 0.125
room_typePrivate room -0.704 0.016 -45.310 0.000 -0.735 -0.674
room_typeShared room -1.023 0.109 -9.375 0.000 -1.237 -0.809
model3 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.261 0.26 0.615 469 0 9 -11148 22318 22399 4518 11949 11959
msummary(model3)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            8.077277   0.060283  133.99  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.328617   0.046721    7.03  2.1e-12 ***
## prop_type_simplifiedGuest suite       -0.294813   0.020731  -14.22  < 2e-16 ***
## prop_type_simplifiedHouse              0.379143   0.015678   24.18  < 2e-16 ***
## prop_type_simplifiedOther              0.092156   0.015904    5.79  7.0e-09 ***
## prop_type_simplifiedVilla              1.347416   0.034756   38.77  < 2e-16 ***
## review_scores_rating                   0.003716   0.000635    5.86  4.9e-09 ***
## room_typeHotel room                    0.025498   0.050574    0.50     0.61    
## room_typePrivate room                 -0.704154   0.015541  -45.31  < 2e-16 ***
## room_typeShared room                  -1.023137   0.109138   -9.37  < 2e-16 ***
## 
## Residual standard error: 0.615 on 11949 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.261,  Adjusted R-squared:  0.26 
## F-statistic:  469 on 9 and 11949 DF,  p-value: <2e-16
car::vif(model3)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.37  5            1.03
## review_scores_rating 1.02  1            1.01
## room_type            1.36  3            1.05
autoplot(model3)

After taking number_of_reviews away, we find that there is almost no change in adjusted R-squared. Also, there is no change in collinearity and significance after eliminating number_of_reviews. So, we can get rid of number_of_reviews in our model.

4.1.4 Model 4: Adding bathrooms, bedrooms, beds, accommodates

Now, let’s take a look at the number of bathrooms, bedrooms, beds and the number of people they can accommodates. We want to know if they are significant in predicting price of 4 nights in Cape Town.

model4 <- lm(log (price_4_nights) ~ prop_type_simplified + 
               review_scores_rating + room_type + bathrooms + bedrooms + beds + 
               accommodates, data= listings)

model4 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.443 0.053 140.283 0.000 7.339 7.547
prop_type_simplifiedBed and breakfast 0.082 0.040 2.030 0.042 0.003 0.161
prop_type_simplifiedGuest suite -0.236 0.018 -13.291 0.000 -0.271 -0.201
prop_type_simplifiedHouse 0.016 0.015 1.088 0.277 -0.013 0.044
prop_type_simplifiedOther 0.007 0.014 0.484 0.629 -0.020 0.033
prop_type_simplifiedVilla 0.450 0.033 13.790 0.000 0.386 0.514
review_scores_rating 0.004 0.001 7.334 0.000 0.003 0.005
room_typeHotel room 0.143 0.043 3.318 0.001 0.059 0.228
room_typePrivate room -0.384 0.014 -26.812 0.000 -0.412 -0.356
room_typeShared room -0.764 0.095 -8.086 0.000 -0.950 -0.579
bathrooms 0.126 0.008 15.452 0.000 0.110 0.142
bedrooms 0.068 0.009 7.243 0.000 0.050 0.086
beds -0.068 0.005 -14.340 0.000 -0.078 -0.059
accommodates 0.146 0.005 31.265 0.000 0.137 0.155
model4 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.462 0.462 0.525 787 0 13 -9213 18456 18567 3275 11903 11917
msummary(model4)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.442641   0.053054  140.28  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.081656   0.040226    2.03  0.04239 *  
## prop_type_simplifiedGuest suite       -0.235741   0.017737  -13.29  < 2e-16 ***
## prop_type_simplifiedHouse              0.015862   0.014580    1.09  0.27667    
## prop_type_simplifiedOther              0.006603   0.013649    0.48  0.62852    
## prop_type_simplifiedVilla              0.449898   0.032626   13.79  < 2e-16 ***
## review_scores_rating                   0.004017   0.000548    7.33  2.4e-13 ***
## room_typeHotel room                    0.143330   0.043196    3.32  0.00091 ***
## room_typePrivate room                 -0.383781   0.014314  -26.81  < 2e-16 ***
## room_typeShared room                  -0.764245   0.094510   -8.09  6.7e-16 ***
## bathrooms                              0.126130   0.008163   15.45  < 2e-16 ***
## bedrooms                               0.067896   0.009374    7.24  4.6e-13 ***
## beds                                  -0.068429   0.004772  -14.34  < 2e-16 ***
## accommodates                           0.145776   0.004663   31.27  < 2e-16 ***
## 
## Residual standard error: 0.525 on 11903 degrees of freedom
##   (4647 observations deleted due to missingness)
## Multiple R-squared:  0.462,  Adjusted R-squared:  0.462 
## F-statistic:  787 on 13 and 11903 DF,  p-value: <2e-16
car::vif(model4)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.93  5            1.07
## review_scores_rating 1.02  1            1.01
## room_type            1.62  3            1.08
## bathrooms            3.20  1            1.79
## bedrooms             5.43  1            2.33
## beds                 2.97  1            1.72
## accommodates         3.87  1            1.97
autoplot(model4)

Bathrooms, bedrooms, beds and accommodates are all significant predictors of the cost for 4 night as they all have T-statistics >2 and p-vale <0.05. That is to say, the more bathrooms/bedrooms/beds it has or the more people it can accommodates, the more expensive it is.

However, as we mentioned above, the four variables mentioned above are highly correlated with correlation of around 0.7. Considering about collinearity, we can only keep one of those variables. Noticing that accommodates has the biggest T-statistics, we decide to keep it.

4.1.5 Model 5: Adding Air Conditioning

Since as stated earlier only 6% of the households in South Africa have air conditioning, we examine how significant it is in predicting the price (Statista, 2020).

model5 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac, data= listings)

model5 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.445 0.052 143.643 0.000 7.344 7.547
prop_type_simplifiedBed and breakfast 0.131 0.040 3.318 0.001 0.054 0.209
prop_type_simplifiedGuest suite -0.210 0.018 -11.965 0.000 -0.244 -0.176
prop_type_simplifiedHouse 0.079 0.014 5.499 0.000 0.051 0.107
prop_type_simplifiedOther 0.005 0.013 0.379 0.705 -0.021 0.032
prop_type_simplifiedVilla 0.546 0.032 17.173 0.000 0.484 0.609
review_scores_rating 0.004 0.001 6.915 0.000 0.003 0.005
room_typeHotel room 0.090 0.043 2.101 0.036 0.006 0.174
room_typePrivate room -0.362 0.014 -25.605 0.000 -0.389 -0.334
room_typeShared room -1.073 0.092 -11.651 0.000 -1.253 -0.892
accommodates 0.167 0.003 59.584 0.000 0.161 0.172
acTRUE 0.326 0.011 30.214 0.000 0.305 0.347
model5 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.474 0.474 0.519 979 0 11 -9115 18256 18352 3215 11947 11959
msummary(model5)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.445308   0.051832  143.64  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.131212   0.039540    3.32  0.00091 ***
## prop_type_simplifiedGuest suite       -0.210012   0.017552  -11.97  < 2e-16 ***
## prop_type_simplifiedHouse              0.078885   0.014347    5.50  3.9e-08 ***
## prop_type_simplifiedOther              0.005106   0.013484    0.38  0.70496    
## prop_type_simplifiedVilla              0.546296   0.031812   17.17  < 2e-16 ***
## review_scores_rating                   0.003707   0.000536    6.91  4.9e-12 ***
## room_typeHotel room                    0.089831   0.042765    2.10  0.03570 *  
## room_typePrivate room                 -0.361531   0.014120  -25.60  < 2e-16 ***
## room_typeShared room                  -1.072942   0.092089  -11.65  < 2e-16 ***
## accommodates                           0.166640   0.002797   59.58  < 2e-16 ***
## acTRUE                                 0.325647   0.010778   30.21  < 2e-16 ***
## 
## Residual standard error: 0.519 on 11947 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.474,  Adjusted R-squared:  0.474 
## F-statistic:  979 on 11 and 11947 DF,  p-value: <2e-16
car::vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.85  5            1.06
## review_scores_rating 1.02  1            1.01
## room_type            1.58  3            1.08
## accommodates         1.43  1            1.19
## ac                   1.04  1            1.02
autoplot(model5)

As we supposed, A/C is a significant predictor of the cost for 4 night as it has T-statistics >2 and p-vale <0.05! As we predict, if there is air conditioning in the property, it will be more expensive to rent.

Also, all variables’ VIFs are smaller than 5, indicating that there is no collinearity. So we can accept A/C in our model.

4.1.6 Model 6: Adding host_is_superhost

After discussing about the property itself, we also would like to see if the host is good. We think that superhosts would charge us more since they believe their properties are popular in the market.

model6 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + host_is_superhost, data= listings)

model6 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.418 0.052 141.574 0.000 7.315 7.521
prop_type_simplifiedBed and breakfast 0.130 0.040 3.285 0.001 0.052 0.207
prop_type_simplifiedGuest suite -0.204 0.018 -11.570 0.000 -0.239 -0.169
prop_type_simplifiedHouse 0.079 0.014 5.486 0.000 0.051 0.107
prop_type_simplifiedOther 0.007 0.013 0.498 0.619 -0.020 0.033
prop_type_simplifiedVilla 0.551 0.032 17.302 0.000 0.488 0.613
review_scores_rating 0.004 0.001 7.514 0.000 0.003 0.005
room_typeHotel room 0.088 0.043 2.055 0.040 0.004 0.172
room_typePrivate room -0.365 0.014 -25.790 0.000 -0.392 -0.337
room_typeShared room -1.076 0.092 -11.689 0.000 -1.256 -0.896
accommodates 0.166 0.003 59.377 0.000 0.161 0.172
acTRUE 0.328 0.011 30.405 0.000 0.307 0.350
host_is_superhostTRUE -0.037 0.011 -3.477 0.001 -0.057 -0.016
model6 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.475 0.474 0.519 899 0 12 -9109 18246 18349 3212 11946 11959
msummary(model6)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.418068   0.052397  141.57  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.129826   0.039524    3.28  0.00102 ** 
## prop_type_simplifiedGuest suite       -0.203970   0.017629  -11.57  < 2e-16 ***
## prop_type_simplifiedHouse              0.078666   0.014340    5.49  4.2e-08 ***
## prop_type_simplifiedOther              0.006716   0.013486    0.50  0.61850    
## prop_type_simplifiedVilla              0.550570   0.031821   17.30  < 2e-16 ***
## review_scores_rating                   0.004128   0.000549    7.51  6.2e-14 ***
## room_typeHotel room                    0.087865   0.042749    2.06  0.03986 *  
## room_typePrivate room                 -0.364763   0.014144  -25.79  < 2e-16 ***
## room_typeShared room                  -1.075955   0.092051  -11.69  < 2e-16 ***
## accommodates                           0.166175   0.002799   59.38  < 2e-16 ***
## acTRUE                                 0.328491   0.010804   30.40  < 2e-16 ***
## host_is_superhostTRUE                 -0.036640   0.010538   -3.48  0.00051 ***
## 
## Residual standard error: 0.519 on 11946 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.475,  Adjusted R-squared:  0.474 
## F-statistic:  899 on 12 and 11946 DF,  p-value: <2e-16
car::vif(model6)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.87  5            1.06
## review_scores_rating 1.07  1            1.03
## room_type            1.59  3            1.08
## accommodates         1.43  1            1.20
## ac                   1.05  1            1.02
## host_is_superhost    1.09  1            1.04
autoplot(model6)

Unfortunately, we are wrong. Superhosts don’t charge us more! We can see that is_super_host is NOT a significant predictor of the cost for 4 night as it has T-statistics <2 and p-vale >0.05.

4.1.7 Model 7: Adding cancellation_policy

We also wonder would strict cancellation policy be an indicator of the price. Intuitively, the higher the price, the more the host would lose if guests canceled. Is that true for property in Cape Town?

model7 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + cancellation_policy, data= listings)

model7 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.383 0.052 143.03 0.000 7.282 7.485
prop_type_simplifiedBed and breakfast 0.128 0.039 3.27 0.001 0.052 0.205
prop_type_simplifiedGuest suite -0.198 0.017 -11.38 0.000 -0.232 -0.164
prop_type_simplifiedHouse 0.079 0.014 5.58 0.000 0.051 0.107
prop_type_simplifiedOther 0.003 0.013 0.25 0.803 -0.023 0.030
prop_type_simplifiedVilla 0.533 0.032 16.92 0.000 0.471 0.595
review_scores_rating 0.004 0.001 6.83 0.000 0.003 0.005
room_typeHotel room 0.087 0.042 2.06 0.040 0.004 0.170
room_typePrivate room -0.350 0.014 -24.91 0.000 -0.377 -0.322
room_typeShared room -1.038 0.091 -11.38 0.000 -1.217 -0.859
accommodates 0.163 0.003 58.74 0.000 0.158 0.169
acTRUE 0.320 0.011 29.95 0.000 0.299 0.341
cancellation_policymoderate 0.033 0.012 2.63 0.008 0.008 0.057
cancellation_policystrict 0.166 0.012 14.28 0.000 0.143 0.189
model7 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.485 0.484 0.514 864 0 13 -8993 18015 18126 3150 11945 11959
msummary(model7)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.383479   0.051621  143.03  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.128266   0.039163    3.28   0.0011 ** 
## prop_type_simplifiedGuest suite       -0.197971   0.017392  -11.38  < 2e-16 ***
## prop_type_simplifiedHouse              0.079175   0.014203    5.57  2.5e-08 ***
## prop_type_simplifiedOther              0.003336   0.013348    0.25   0.8027    
## prop_type_simplifiedVilla              0.532964   0.031505   16.92  < 2e-16 ***
## review_scores_rating                   0.003627   0.000531    6.83  9.1e-12 ***
## room_typeHotel room                    0.087091   0.042352    2.06   0.0398 *  
## room_typePrivate room                 -0.349656   0.014036  -24.91  < 2e-16 ***
## room_typeShared room                  -1.037813   0.091191  -11.38  < 2e-16 ***
## accommodates                           0.163234   0.002779   58.74  < 2e-16 ***
## acTRUE                                 0.319853   0.010679   29.95  < 2e-16 ***
## cancellation_policymoderate            0.032820   0.012468    2.63   0.0085 ** 
## cancellation_policystrict              0.166290   0.011643   14.28  < 2e-16 ***
## 
## Residual standard error: 0.514 on 11945 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.485,  Adjusted R-squared:  0.484 
## F-statistic:  864 on 13 and 11945 DF,  p-value: <2e-16
car::vif(model7)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.86  5            1.06
## review_scores_rating 1.02  1            1.01
## room_type            1.60  3            1.08
## accommodates         1.44  1            1.20
## ac                   1.04  1            1.02
## cancellation_policy  1.04  2            1.01
autoplot(model7)

We can see that cancellation_policy is a significant predictor of the cost for 4 night as it has T-statistics <2 and p-vale >0.05. It is true that host will impose strict cancellation policy if the price of his property is high.

Also, all variables’ VIFs are smaller than 5, indicating that there is no collinearity. So we can accept cancellation_policy in our model.

As mentioned previously, we have two ways to simplify neighbourhood information. Here, we compare two methods (Method 1 in model 8, Method 2 in model 9) and select the best to add to our model.

4.1.8 Model 8: Adding neighbourhood information

model8 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + cancellation_policy + neighbourhood_simplified, data= listings)

model8 %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.529 0.051 148.910 0.000 7.430 7.628
prop_type_simplifiedBed and breakfast 0.166 0.038 4.334 0.000 0.091 0.240
prop_type_simplifiedGuest suite -0.150 0.017 -8.785 0.000 -0.183 -0.116
prop_type_simplifiedHouse 0.091 0.014 6.567 0.000 0.064 0.118
prop_type_simplifiedOther 0.010 0.013 0.798 0.425 -0.015 0.036
prop_type_simplifiedVilla 0.491 0.031 15.970 0.000 0.431 0.551
review_scores_rating 0.003 0.001 5.427 0.000 0.002 0.004
room_typeHotel room 0.080 0.041 1.945 0.052 -0.001 0.161
room_typePrivate room -0.313 0.014 -22.818 0.000 -0.340 -0.286
room_typeShared room -1.084 0.089 -12.214 0.000 -1.258 -0.910
accommodates 0.167 0.003 61.687 0.000 0.162 0.173
acTRUE 0.305 0.010 29.197 0.000 0.285 0.326
cancellation_policymoderate 0.012 0.012 0.994 0.320 -0.012 0.036
cancellation_policystrict 0.137 0.011 12.057 0.000 0.115 0.160
neighbourhood_simplifiedbetter -0.183 0.014 -13.270 0.000 -0.210 -0.156
neighbourhood_simplifiedgood -0.308 0.013 -23.148 0.000 -0.334 -0.282
neighbourhood_simplifiedstandard -0.523 0.056 -9.250 0.000 -0.633 -0.412
model8 %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.512 0.512 0.5 785 0 16 -8661 17359 17492 2981 11942 11959
msummary(model8)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.528936   0.050560  148.91  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.165522   0.038188    4.33  1.5e-05 ***
## prop_type_simplifiedGuest suite       -0.149805   0.017053   -8.78  < 2e-16 ***
## prop_type_simplifiedHouse              0.091091   0.013870    6.57  5.3e-11 ***
## prop_type_simplifiedOther              0.010381   0.013009    0.80    0.425    
## prop_type_simplifiedVilla              0.490780   0.030731   15.97  < 2e-16 ***
## review_scores_rating                   0.002810   0.000518    5.43  5.8e-08 ***
## room_typeHotel room                    0.080174   0.041220    1.95    0.052 .  
## room_typePrivate room                 -0.313203   0.013726  -22.82  < 2e-16 ***
## room_typeShared room                  -1.083994   0.088748  -12.21  < 2e-16 ***
## accommodates                           0.167380   0.002713   61.69  < 2e-16 ***
## acTRUE                                 0.305286   0.010456   29.20  < 2e-16 ***
## cancellation_policymoderate            0.012087   0.012156    0.99    0.320    
## cancellation_policystrict              0.137232   0.011382   12.06  < 2e-16 ***
## neighbourhood_simplifiedbetter        -0.183127   0.013800  -13.27  < 2e-16 ***
## neighbourhood_simplifiedgood          -0.308200   0.013315  -23.15  < 2e-16 ***
## neighbourhood_simplifiedstandard      -0.522629   0.056499   -9.25  < 2e-16 ***
## 
## Residual standard error: 0.5 on 11942 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.512,  Adjusted R-squared:  0.512 
## F-statistic:  785 on 16 and 11942 DF,  p-value: <2e-16
car::vif(model8)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.91  5            1.07
## review_scores_rating     1.02  1            1.01
## room_type                1.62  3            1.08
## accommodates             1.45  1            1.20
## ac                       1.06  1            1.03
## cancellation_policy      1.05  2            1.01
## neighbourhood_simplified 1.09  3            1.01
autoplot(model8)

5 Final Model

So, here is our final model. It includes the following explanatory variables: prop_type_simplified, review_scores_rating, room_type, accommodates, ac, cancellation_policy, and neighbourhood_simplified.

model_final <- lm(log (price_4_nights) ~ prop_type_simplified + 
                    review_scores_rating +  room_type + accommodates + ac + 
                    cancellation_policy + neighbourhood_simplified, data= listings)

model_final %>% 
  broom::tidy(conf.int=TRUE)%>%
  kable()%>%
  kable_styling()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 7.529 0.051 148.910 0.000 7.430 7.628
prop_type_simplifiedBed and breakfast 0.166 0.038 4.334 0.000 0.091 0.240
prop_type_simplifiedGuest suite -0.150 0.017 -8.785 0.000 -0.183 -0.116
prop_type_simplifiedHouse 0.091 0.014 6.567 0.000 0.064 0.118
prop_type_simplifiedOther 0.010 0.013 0.798 0.425 -0.015 0.036
prop_type_simplifiedVilla 0.491 0.031 15.970 0.000 0.431 0.551
review_scores_rating 0.003 0.001 5.427 0.000 0.002 0.004
room_typeHotel room 0.080 0.041 1.945 0.052 -0.001 0.161
room_typePrivate room -0.313 0.014 -22.818 0.000 -0.340 -0.286
room_typeShared room -1.084 0.089 -12.214 0.000 -1.258 -0.910
accommodates 0.167 0.003 61.687 0.000 0.162 0.173
acTRUE 0.305 0.010 29.197 0.000 0.285 0.326
cancellation_policymoderate 0.012 0.012 0.994 0.320 -0.012 0.036
cancellation_policystrict 0.137 0.011 12.057 0.000 0.115 0.160
neighbourhood_simplifiedbetter -0.183 0.014 -13.270 0.000 -0.210 -0.156
neighbourhood_simplifiedgood -0.308 0.013 -23.148 0.000 -0.334 -0.282
neighbourhood_simplifiedstandard -0.523 0.056 -9.250 0.000 -0.633 -0.412
model_final %>% 
  broom::glance()%>%
  kable()%>%
  kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.512 0.512 0.5 785 0 16 -8661 17359 17492 2981 11942 11959
msummary(model_final)
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.528936   0.050560  148.91  < 2e-16 ***
## prop_type_simplifiedBed and breakfast  0.165522   0.038188    4.33  1.5e-05 ***
## prop_type_simplifiedGuest suite       -0.149805   0.017053   -8.78  < 2e-16 ***
## prop_type_simplifiedHouse              0.091091   0.013870    6.57  5.3e-11 ***
## prop_type_simplifiedOther              0.010381   0.013009    0.80    0.425    
## prop_type_simplifiedVilla              0.490780   0.030731   15.97  < 2e-16 ***
## review_scores_rating                   0.002810   0.000518    5.43  5.8e-08 ***
## room_typeHotel room                    0.080174   0.041220    1.95    0.052 .  
## room_typePrivate room                 -0.313203   0.013726  -22.82  < 2e-16 ***
## room_typeShared room                  -1.083994   0.088748  -12.21  < 2e-16 ***
## accommodates                           0.167380   0.002713   61.69  < 2e-16 ***
## acTRUE                                 0.305286   0.010456   29.20  < 2e-16 ***
## cancellation_policymoderate            0.012087   0.012156    0.99    0.320    
## cancellation_policystrict              0.137232   0.011382   12.06  < 2e-16 ***
## neighbourhood_simplifiedbetter        -0.183127   0.013800  -13.27  < 2e-16 ***
## neighbourhood_simplifiedgood          -0.308200   0.013315  -23.15  < 2e-16 ***
## neighbourhood_simplifiedstandard      -0.522629   0.056499   -9.25  < 2e-16 ***
## 
## Residual standard error: 0.5 on 11942 degrees of freedom
##   (4605 observations deleted due to missingness)
## Multiple R-squared:  0.512,  Adjusted R-squared:  0.512 
## F-statistic:  785 on 16 and 11942 DF,  p-value: <2e-16
#To check the regression table, collinearity and residual
car::vif(model_final)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.91  5            1.07
## review_scores_rating     1.02  1            1.01
## room_type                1.62  3            1.08
## accommodates             1.45  1            1.20
## ac                       1.06  1            1.03
## cancellation_policy      1.05  2            1.01
## neighbourhood_simplified 1.09  3            1.01
autoplot(model_final)

Below, we interpret the final model coefficients:

cancellation_policy For every 1 unit increase in review_scores_rating, there is an increase of 0.28% in price_4_nights (Equivalent to (exp(0.00281) - 1)*100)

property_type Price for 4 nights at a Bed and Breakfast is on average 18.0% higher than the price for an apartment. (Equivalent to (exp(0.16552) - 1)*100) = (18.0%) Price for 4 nights at a Guest Suite is on average 13.9% cheaper than the price for an apartment. (Equivalent to (exp(-0.14981) - 1)100) = -13.9%) Price for 4 nights at a House is on average 9.5% higher than the price for an apartment. (Equivalent to (exp(0.09109) - 1)100) = 9.5%) Price for 4 nights at a Villa is on average 63.4% higher** than the price for an apartment. (Equivalent to (exp(0.49078) - 1)*100)= 63.4%)

cancellation_policy The price of a property with a moderate cancellation policy does not differ significantly from the price for a property with a flexible cancellation policy. The price of a property with a strict cancellation policy is on average 14.7% higher than the price for a property with a flexible cancellation policy. (Equivalent to (exp(0.13723) - 1)*100)= 14.7%)

room_type The price of a Hotel room is on average 8.3% higher than the price for an Entire home or apartment. (Equivalent to (exp(0.08017) - 1)100)= 8.3%). However, the price is significant only at a 10% significance level. The price of a Private room is on average 26.8% lower than the price for an Entire home or apartment. (Equivalent to (exp(-0.31320) - 1)100)= -26.8%) The price of a Shared room is on average 66.2% lower than the price for an Entire home or apartment. (Equivalent to (exp(-1.08399) - 1)*100)= -66.2%)

accomodates For each additional person a property accommodates, the average increase in price for 4 nights is 18.2%. (Equivalent to (exp(0.16738) - 1)*100)= 18.2%)

neighbourhood_simplified The price of a property in a better neighbourhood is on average 16.7% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.18313) - 1)100)= -16.7%) The price of a property in a good neighbourhood is on average 26.5% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.30820) - 1)100)= -26.5%) The price of a property in a standard neighbourhood is on average 40.7% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.52263) - 1)*100)= -40.7%)

ac The price of a property with an ac is on average 35.7% higher than the price of a property without one. (Equivalent to (exp(0.30529) - 1)*100)= 35.7%)

5.1 Summary table of final model and rationale

#creating comparison table
huxreg(model1,model2, model3, model4, model5, model6, model7, model_final,
       statistics = c('#observations' = 'nobs',
                      'R squared' = 'r.squared',
                      'Adj. R Squared' = 'adj.r.squared',
                      'Residual SE' = 'sigma'),
       bold_signif = 0.05,
       stars = NULL
) %>%
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)7.824 8.059 8.077 7.443 7.445 7.418 7.383 7.529 
(0.065)(0.060)(0.060)(0.053)(0.052)(0.052)(0.052)(0.051)
prop_type_simplifiedBed and breakfast-0.107 0.315 0.329 0.082 0.131 0.130 0.128 0.166 
(0.046)(0.047)(0.047)(0.040)(0.040)(0.040)(0.039)(0.038)
prop_type_simplifiedGuest suite-0.422 -0.287 -0.295 -0.236 -0.210 -0.204 -0.198 -0.150 
(0.022)(0.021)(0.021)(0.018)(0.018)(0.018)(0.017)(0.017)
prop_type_simplifiedHouse0.130 0.369 0.379 0.016 0.079 0.079 0.079 0.091 
(0.016)(0.016)(0.016)(0.015)(0.014)(0.014)(0.014)(0.014)
prop_type_simplifiedOther-0.000 0.094 0.092 0.007 0.005 0.007 0.003 0.010 
(0.017)(0.016)(0.016)(0.014)(0.013)(0.013)(0.013)(0.013)
prop_type_simplifiedVilla1.233 1.334 1.347 0.450 0.546 0.551 0.533 0.491 
(0.038)(0.035)(0.035)(0.033)(0.032)(0.032)(0.032)(0.031)
number_of_reviews-0.001 -0.002                               
(0.000)(0.000)                              
review_scores_rating0.006 0.004 0.004 0.004 0.004 0.004 0.004 0.003 
(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)
room_typeHotel room     0.008 0.025 0.143 0.090 0.088 0.087 0.080 
     (0.050)(0.051)(0.043)(0.043)(0.043)(0.042)(0.041)
room_typePrivate room     -0.712 -0.704 -0.384 -0.362 -0.365 -0.350 -0.313 
     (0.016)(0.016)(0.014)(0.014)(0.014)(0.014)(0.014)
room_typeShared room     -1.046 -1.023 -0.764 -1.073 -1.076 -1.038 -1.084 
     (0.109)(0.109)(0.095)(0.092)(0.092)(0.091)(0.089)
bathrooms               0.126                     
               (0.008)                    
bedrooms               0.068                     
               (0.009)                    
beds               -0.068                     
               (0.005)                    
accommodates               0.146 0.167 0.166 0.163 0.167 
               (0.005)(0.003)(0.003)(0.003)(0.003)
acTRUE                    0.326 0.328 0.320 0.305 
                    (0.011)(0.011)(0.011)(0.010)
host_is_superhostTRUE                         -0.037           
                         (0.011)          
cancellation_policymoderate                              0.033 0.012 
                              (0.012)(0.012)
cancellation_policystrict                              0.166 0.137 
                              (0.012)(0.011)
neighbourhood_simplifiedbetter                                   -0.183 
                                   (0.014)
neighbourhood_simplifiedgood                                   -0.308 
                                   (0.013)
neighbourhood_simplifiedstandard                                   -0.523 
                                   (0.056)
#observations11959     11959     11959     11917     11959     11959     11959     11959     
R squared0.131 0.267 0.261 0.462 0.474 0.475 0.485 0.512 
Adj. R Squared0.130 0.266 0.260 0.462 0.474 0.474 0.484 0.512 
Residual SE0.667 0.613 0.615 0.525 0.519 0.519 0.514 0.500 

This table shows all of the coefficients and R squared of the models that we fit. Our final model has the highest R2 out of the models that we fit.

5.2 Final model diagnostics

We can see from “Residual Vs Fitted” that residuals are random, with no pattern and around Y=0, indicating that the the regression follows the linearity assumption.

In “Normal Q-Q” plot, we notice that residuals don’t deviate from the straight line, so we can confirm that our regression follows the normality assumption.

Checking on “Scale-Location” plot, we find that residuals are stable at around 1 when fitted values are around 7 to 10. It is acceptable to say that our regression satisfies the Equal Variance assumption in general.

Lastly, we look at “Residuals Vs Leverage” plot. We can clearly see that leverage is smaller than 0.5, indicating that most data points don’t have unusual values of predictors.

5.3 Final model Conclusion

Based on Airbnb data, the best predictors of the price for 2 persons for 4 nights are: property type, room type, review ratings scores, how strict the cancellation policy is, how many people the property accommodates, which neighbourhood the property is in, and if the property has air conditioning installed.

We find that the price of booking a Villa is the highest, whereas the price of booking a Guest suite is the lowest. Additionally, we find it interesting that the price of a Bed and Breakfast is on average higher than the price of an Apartment. Furthermore, it seems that the stricter the cancellation policy the higher the price. In addition, the model shows that the price varies significantly based on which neighbourhood the property is located in. Better neighbourhoods have a higher price on average, and so do the properties that accommodate more people. To conclude, since the average temperatures in Cape Town are high, having Air conditioning installed is associated with an average price increase of 30%.

5.4 Predicting the total cost to stay at an Airbnb for 4 nights using our model

We will now use our final model to predict the total cost to stay at an Airbnb for 4 nights, filtering for listings that have at least 10 reviews and an average rating of at least 90.

Preparing the testing data set:

#First we create a new data set accoring to the conditions mentioned above

listings_test_filtered <- listings %>% 
  filter(property_type == "Apartment",
         room_type == "Private room", 
         number_of_reviews >= 10,
         review_scores_rating >= 90
         )

# A quick look at the dimensions of the data frame shows us that we are left with 128 observations that fulfill those conditions

dim(listings_test_filtered)
## [1] 128  45

Predicting prices for our test listings:

#listings_test_filtered <- listings_test_filtered %>% 
#  add_predictions(model3, var="predict_price")


#using the predict function to get create a dataframe that contains the point estimate as well as the prediction interval
prediction_results <- predict(model_final, newdata = listings_test_filtered, interval = c("prediction"))

#as all estimates are still in log, we have to reverse that
prediction_results <-exp(prediction_results)

#we have a quick look at the results table
prediction_results%>%
  kable()%>%
  kable_styling()
fit lwr upr
2519 946 6710
2514 944 6697
2521 946 6715
2889 1085 7697
2505 940 6672
2535 952 6753
2865 1076 7632
2528 949 6734
2849 1070 7589
3383 1270 9012
2512 943 6691
2889 1085 7697
2817 1058 7504
2881 1082 7675
2549 957 6792
3450 1295 9191
2873 1079 7653
2535 952 6753
2535 952 6753
2889 1085 7697
2535 952 6753
2881 1082 7675
2479 931 6607
2857 1073 7611
2512 943 6691
2881 1082 7675
2841 1067 7568
2486 933 6622
2528 949 6734
3421 1284 9114
2865 1076 7632
2491 935 6635
2873 1079 7653
2507 941 6678
2505 940 6672
2881 1082 7675
3421 1284 9114
2535 952 6753
3877 1455 10329
2111 792 5625
2505 940 6672
2479 931 6603
2820 1058 7515
2542 954 6772
2542 954 6772
2865 1075 7634
2881 1081 7677
2833 1064 7547
2549 957 6792
3421 1284 9114
2841 1067 7568
3899 1463 10387
3431 1288 9140
2111 792 5625
2099 788 5593
2507 941 6678
2889 1085 7697
2841 1067 7568
2978 1118 7933
2873 1079 7653
2849 1070 7589
2849 1070 7589
2528 949 6734
2514 944 6697
2505 940 6672
2865 1076 7632
1858 697 4949
2857 1073 7611
2111 792 5625
3910 1468 10416
3373 1266 8987
2833 1064 7547
2068 776 5510
2849 1070 7589
2889 1085 7697
2500 938 6659
2498 938 6653
2873 1079 7653
2849 1070 7589
2535 952 6753
2519 946 6710
3333 1251 8879
2507 941 6678
2449 919 6524
2512 943 6691
2498 938 6653
3174 1191 8459
2528 949 6734
1507 562 4040
2889 1085 7697
2825 1061 7525
2865 1076 7632
2528 949 6734
2528 949 6734
2477 930 6597
1847 693 4921
3361 1262 8954
1815 681 4835
2489 934 6631
2873 1079 7653
2809 1055 7483
2500 938 6659
2857 1073 7611
2105 790 5609
2881 1082 7675
2519 946 6710
1815 681 4835
2111 792 5625
2841 1067 7568
2817 1058 7504
2500 938 6659
2873 1079 7653
2865 1076 7632
2406 903 6410
2809 1055 7483
2080 781 5541
2542 954 6772
2549 957 6792
2491 935 6635
2535 952 6753
2549 957 6792
2549 957 6792
2484 932 6616
1835 689 4890
2074 778 5526
1840 691 4903
2505 940 6672
1832 687 4880

Merging our prediction results with the the listing test data set:

listings_test_filtered_2 <- listings_test_filtered %>% 
  
  #keeping only columns which are relevant for further analysis
  select(-c(2:43, 45)) %>% 
  
  #adding the estimates, upper and lower bounds
  mutate(pred_price_fit = prediction_results[,1],
         pred_price_lwr = prediction_results[,2],
         pred_price_upr = prediction_results[,3],
         
         #creates new variables containing diffrence between estimate and actual value
         pred_difference = pred_price_fit - price_4_nights,
         
         #crates new variable that checks how far off the actual values are from the mean price of the overall data set (mean = 7,799 $)
         difference_mean = mean(listings$price_4_nights) - price_4_nights
         )

listings_test_filtered_2%>%
   kable()%>%
  kable_styling()
id price_4_nights pred_price_fit pred_price_lwr pred_price_upr pred_difference difference_mean
390198 3288 2519 946 6710 -769.2 4511
667072 5750 2514 944 6697 -3236.1 2049
3583417 1099 2521 946 6715 1421.9 6700
4034553 4976 2889 1085 7697 -2086.7 2823
4135729 2828 2505 940 6672 -323.3 4971
4402540 2868 2535 952 6753 -332.9 4931
4577312 3370 2865 1076 7632 -505.0 4429
4613660 2354 2528 949 6734 174.0 5445
4737256 4420 2849 1070 7589 -1571.0 3379
4797434 2178 3383 1270 9012 1204.7 5621
5081314 3758 2512 943 6691 -1246.3 4041
5107965 3361 2889 1085 7697 -471.7 4438
5119748 3052 2817 1058 7504 -234.9 4747
5325479 3865 2881 1082 7675 -983.8 3934
5601379 1980 2549 957 6792 569.4 5819
5923679 1596 3450 1295 9191 1853.9 6203
6070723 2006 2873 1079 7653 867.1 5793
6193802 2972 2535 952 6753 -436.9 4827
6235176 1886 2535 952 6753 649.1 5913
6521201 4274 2889 1085 7697 -1384.7 3525
6918629 4050 2535 952 6753 -1514.9 3749
7449161 6936 2881 1082 7675 -4054.8 863
7608038 2904 2479 931 6607 -424.5 4895
8076245 3772 2857 1073 7611 -915.0 4027
8319060 2196 2512 943 6691 315.7 5603
8479864 4128 2881 1082 7675 -1246.8 3671
8626291 1776 2841 1067 7568 1065.0 6023
9603512 2600 2486 933 6622 -114.2 5199
9730452 2796 2528 949 6734 -268.0 5003
9869093 4274 3421 1284 9114 -853.1 3525
10118408 3268 2865 1076 7632 -403.0 4531
10171027 1388 2491 935 6635 1102.6 6411
10209764 2516 2873 1079 7653 357.1 5283
10370371 2320 2507 941 6678 186.8 5479
10377262 1576 2505 940 6672 928.7 6223
10377744 2604 2881 1082 7675 277.2 5195
10665293 2087 3421 1284 9114 1333.9 5712
11004889 1500 2535 952 6753 1035.1 6299
11561268 2416 3877 1455 10329 1461.0 5383
11613223 2568 2111 792 5625 -456.9 5231
12047000 2200 2505 940 6672 304.7 5599
12311907 1538 2479 931 6603 940.8 6261
12879874 1447 2820 1058 7515 1373.3 6352
13390581 5092 2542 954 6772 -2549.7 2707
13424912 2212 2542 954 6772 330.3 5587
13427320 1946 2865 1075 7634 918.7 5853
13556565 4278 2881 1081 7677 -1397.1 3521
13589317 1804 2833 1064 7547 1029.0 5995
14001895 4324 2549 957 6792 -1774.6 3475
14099119 2141 3421 1284 9114 1279.9 5658
14174713 2196 2841 1067 7568 645.0 5603
14174782 1388 3899 1463 10387 2510.8 6411
14197317 2212 3431 1288 9140 1218.6 5587
14205736 1946 2111 792 5625 165.1 5853
14205988 1530 2099 788 5593 569.2 6269
14295791 2568 2507 941 6678 -61.2 5231
14904443 2162 2889 1085 7697 727.3 5637
15032356 2808 2841 1067 7568 33.0 4991
15513609 2104 2978 1118 7933 873.7 5695
15650593 4784 2873 1079 7653 -1910.9 3015
15793161 2220 2849 1070 7589 629.0 5579
15827702 1596 2849 1070 7589 1253.0 6203
15986335 3054 2528 949 6734 -526.0 4745
16022861 2012 2514 944 6697 501.9 5787
16357189 6032 2505 940 6672 -3527.3 1767
16425395 2204 2865 1076 7632 661.0 5595
16492890 2288 1858 697 4949 -430.5 5511
16715956 2004 2857 1073 7611 853.0 5795
16762435 1868 2111 792 5625 242.9 5931
16891977 2789 3910 1468 10416 1120.8 5010
17289226 2262 3373 1266 8987 1111.2 5537
17487415 2288 2833 1064 7547 545.0 5511
17563634 2568 2068 776 5510 -500.0 5231
17571972 2212 2849 1070 7589 637.0 5587
17915175 2144 2889 1085 7697 745.3 5655
18107701 1174 2500 938 6659 1325.8 6625
18150180 2080 2498 938 6653 417.7 5719
18151764 2022 2873 1079 7653 851.1 5777
18424494 1540 2849 1070 7589 1309.0 6259
18698721 2444 2535 952 6753 91.1 5355
18998259 2004 2519 946 6710 514.8 5795
19093535 4946 3333 1251 8879 -1613.3 2853
19395647 2448 2507 941 6678 58.8 5351
19513486 3952 2449 919 6524 -1503.0 3847
19543437 1746 2512 943 6691 765.7 6053
19668020 832 2498 938 6653 1665.7 6967
20026625 1824 3174 1191 8459 1350.2 5975
20061001 1976 2528 949 6734 552.0 5823
20578280 1463 1507 562 4040 44.5 6336
20595620 2738 2889 1085 7697 151.3 5061
20894159 1180 2825 1061 7525 1645.1 6619
20927864 2074 2865 1076 7632 791.0 5725
21136471 1388 2528 949 6734 1140.0 6411
21184207 3222 2528 949 6734 -694.0 4577
21453143 2412 2477 930 6597 64.7 5387
21547885 1444 1847 693 4921 403.1 6355
21572032 4875 3361 1262 8954 -1514.1 2924
21743669 1520 1815 681 4835 294.7 6279
21775447 1888 2489 934 6631 600.7 5911
21894479 3172 2873 1079 7653 -298.9 4627
22037476 2797 2809 1055 7483 12.2 5002
22512543 1732 2500 938 6659 767.8 6067
22984673 2052 2857 1073 7611 805.0 5747
23240465 2008 2105 790 5609 97.1 5791
23429658 2464 2881 1082 7675 417.2 5335
23460305 1174 2519 946 6710 1344.8 6625
24172166 1528 1815 681 4835 286.7 6271
24887007 2220 2111 792 5625 -109.1 5579
24999034 2299 2841 1067 7568 542.0 5500
26437576 3372 2817 1058 7504 -554.9 4427
26463385 2680 2500 938 6659 -180.2 5119
27039679 1904 2873 1079 7653 969.1 5895
27368942 2430 2865 1076 7632 435.0 5369
27595742 3148 2406 903 6410 -742.2 4651
27943381 2307 2809 1055 7483 502.2 5492
28200071 2776 2080 781 5541 -696.3 5023
30557302 2220 2542 954 6772 322.3 5579
31287691 2152 2549 957 6792 397.4 5647
31478216 1892 2491 935 6635 598.6 5907
31890053 2247 2535 952 6753 288.1 5552
34078010 2000 2549 957 6792 549.4 5799
34348912 1988 2549 957 6792 561.4 5811
34558453 3020 2484 932 6616 -536.3 4779
36925349 1388 1835 689 4890 447.2 6411
37439353 2143 2074 778 5526 -69.1 5656
38962365 1814 1840 691 4903 26.4 5985
39934648 2122 2505 940 6672 382.7 5677
39968107 1463 1832 687 4880 368.6 6336

Summary Statistics:

In this section we will briefly run through some summary statistics to get a feel for how well we predicted the actual price for 4 nights.

# In this table we compare the average 4 night price of the testing data set with the average of our predictions

listings_test_filtered_2 %>% 
  summarize(mean_pred = mean(pred_price_fit),
         mean_actual = mean(price_4_nights))
mean_predmean_actual
2.66e+032.54e+03
# In this table we look at the average amount our predictions differ from the actual listing prices
# We then compare it to the average difference to the overall average price to confirm that our model acts as a better predictor of price than the average

listings_test_filtered_2 %>% 
  summarize(using_model = mean(pred_difference),
            using_mean = mean(difference_mean))
using_modelusing_mean
1185.26e+03

The first table shows that on average, our model slightly underestimates the true mean price of the test data set (diff of 118 USD).

The second table gives us a rough estimate on how our model’s prediction performance as compared to just using the overall mean price of the original data set. while our model slightly overestimates the true values (on average by 118 USD), an prediction model simply using the overall mean price would heavily overstate the prices (on average by 5256 USD)

Plotting

# We visualize the distributions of the actual prices and our predictions


listings_test_filtered_2 %>% 
  
  #in order to plot the data via ggplot, we have to put the data in "tidy" format, using the pivot longer function
  pivot_longer(names_to = "pred_vs_actual", values_to = "value", cols = c(price_4_nights, pred_price_fit)) %>% 
  
  #we can then continue plotting
  ggplot(aes(x=value, fill = pred_vs_actual)) +
  geom_density(alpha = 0.3)

# The second graph investigates how the difference between prediction and actual values is connect to the actual price for 4 nights

ggplot(listings_test_filtered_2, aes(x=price_4_nights, y=pred_difference)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(se =FALSE, method = lm) +
  labs(title = "The estimation errors seem to be correlated with the price", y="Difference between estimate and actual price", x="Actual price") 

In the first plot we can see the density plot for our estimates and the actual values. The distribution of the actual prices look heavily right skewed, while the model predictions are distributed symetrically.

In the second plot we can see that the prediction error seem to be correlated to the actual price of the listing. The higher the price of the listing, the more our model will underestimate the price.

This could be a sign of a variable currently not included in our model, that would potentially explain this tendency.

Overall however, our final model delivers a good estimate on what how much a couple would have to spend on rent during a 4 day trip to beautiful Cape Town!

6 References

Statista. 2020. Global AC Penetration Rate By Country 2016 | Statista. [online] Available at: https://www.statista.com/statistics/911064/worldwide-air-conditioning-penetration-rate-country/ [Accessed 17 September 2020].